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ABSTRACT 


The aim of this work is to recognize handwritten characters of Indian language, Telugu. Single stage of classifying similar Telugu 
characters leads to low recognition rate. However similar characters of Telugu (Indian language) are recognized in two stages 
in the current work. Various preprocessing steps are carried out first to extract characters from the handwritten documents. 
The preprocessed characters are then utilized to extract features from them. These features are further used in the proposed 
two-stage classification. The misclassified characters from the first stage of classification are fed to the second classifier in the 
proposed method. The recognition rates obtained with the two stage system are better compared to the single stage classifica¬ 
tion system. 
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INTRODUCTION 

Exhaustive work has been contributed on printed text and 
relatively very less amount of research has been reported 
on handwritten text [1,2,3,4,5,6,7,8,9,10]. A comprehensive 
survey on handwritten character recognition were reported in 
[11,12,13,14]. Relatively very less amount of research was 
found on South Indian languages like Tamil, Kannada and 
Telugu etc. [14,15,16,17,18,19,20,21]. 

There are several benchmark datasets available for Latin 
numerals such as MNIST, CEDAR, NIST and CENPARMI 
[22]. The standard dataset available for English alphabets is 
UNIPEN. A few Chinese standard/benchmark handwritten 
databases are as follows: 

• HCL 2000 for Chinese alphabetical characters. 

• ETL8B and ETL9B datasets comprises 956 and 3036 
character classes, respectively. 

• SCUT-IRAC for Chinese numerals. 

• CASIA-HWDB 1.1 for Chinese alphabets, numerals 
and punctuation marks. 

A few Devanagari (Indian script) standard small datasets 
available are V2DMCHAR and ISIDCHAR. As such no 
standard database available for other Indian scripts to con¬ 
duct tests. This is the major problem to do research on Indian 
scripts [23]. All the earlier studies have been reported on col¬ 
lection of small datasets from laboratory environment. 


It is evident from the literature survey [23] that no standard 
dataset of Indian languages is readily available for the re¬ 
search activity. Hence, there is a need to develop the dataset 
in the laboratory environment for any Indian language [23]. 
Therefore in the present work the first stage of research is 
to develop and build a handwritten Telugu character dataset. 

Most of the Telugu characters are similar and recognizing 
such characters is highly challenging task. The number of 
vowels in the script is 16 and the number of consonants is 
36. Identifying such similar characters is a very difficult task. 

This paper deals with the handwritten character recognition 
for Telugu script written on paper documents. It includes the 
methodology used for handwritten Telugu database, the vari¬ 
ous preprocessing steps, feature extraction methods and vari¬ 
ous classifiers involved in the current work. 

To develop the dataset, various scribers of different age 
groups are used to scribe on paper documents. These docu¬ 
ments are scanned at 300 dpi and stored in the hard disk of 
the computer system. In the next step, preprocessing opera¬ 
tions are performed to extract characters. The various fea¬ 
ture extraction algorithms such as ‘cell-wise pixel count’ and 
‘Histogram profile’ are employed to extract features from the 
preprocessed character images. In the proposed two-stage 
classification system, the classifiers employed are k-NN (k 
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Nearest Neighbor) and SVM (Support Vector Machines) to 
classify the characters. 

Data collection and preprocessing 

Due to lack of standard data set to conduct experiments on 
handwritten Telugu characters [23], the data is collected from 
various scribers from different age groups in the laboratory 
environment. The characters written on high quality papers 
in an isolated manner, from 360 individuals are collected to 
develop the handwritten Telugu character set. The number of 
basic handwritten Telugu characters considered in this work 
is 50, this account to 18,000 samples in total (50 x 360). All 
the documents collected from various scribers are scanned at 
300 dpi and stored as images. 

The preprocessed character samples are divided into folds. 
Each fold contains characters written by equal number of 
scribers. To test t th fold, the remaining (V-l) folds are used 
as training. The average classification rates obtained from all 
these folds is considered as the classification rate/recognition 
accuracy of the model. 

The number of characters considered for simulation is 
18,000 containing 50 different classes, written by 360 dif¬ 
ferent scribers. Thereby the number of samples per class is 
360. All the characters are cross validated, by dividing them 
into 8 folds. Each fold contains characters written by 45 dif¬ 
ferent scribers. The number of samples considered in each 
fold is 2,250 i.e., 50 x 45 (where 50 is number of classes and 
45 is the number of scribers). To test a fold of characters, 
the remaining 15,750 characters are used as training. These 
preprocessed character images are used in the proposed step 
by step algorithm as discussed below. 

RESEARCH METHODOLOGY 

The flowchart of the proposed two-stage classification strat¬ 
egy for handwritten Telugu characters is shown in Figure 1. 
The raw preprocessed character image after noise removal 
and character extraction phases is first transformed into use¬ 
ful features. Each character image is represented in the form 
of a vector after the feature extraction stage. 

Each fold of characters is tested in two stages. In the first 
stage of classification, 6 Classifier-1’ is trained with the train¬ 
ing set to classify the characters under test fold. If the pre¬ 
dicted class of the test character is same as that of its actual 
class then it is said to be recognized. The Recognition Accu¬ 
racy (RA) of T th testing fold is computed from the confusion 
matrix generated and is depicted in Equation (1). 

RA =--x 100% (1) 

Total no,of characters tested 

where CRps the number of characters correctly classified in 
stage-1. 


Based on the confusion matrix generated by 6 Classifier-1’ 
in the first stage, the most confusing Telugu characters are 
found out and are classified in the second stage of classifica¬ 
tion. The unrecognized characters of the T test fold from 
the first stage are stored in a bin and are tested in the second 
stage of classification. To improve the character recognition 
rate, the unrecognized characters from the first stage are once 
again classified using another classifier i.e., ‘Classifier-2’. 
To classify the unrecognized characters in the second stage, 
‘Classifier-2’ is trained with the same training set, as indi¬ 
cated in Figure 3. The overall recognition accuracy (ORA) 
of the two-stage classification system for T testing fold is 
computed as depicted in Equation (2). 

ORA = - CRj+CR? - x 100% 

Total no,of characters tested v 7 

where CR 2 is the number of characters classified in stage-2. 



Figure 1: Flowchart of proposed Multi-stage classification 
strategy. 

The overall classification rate is improved with this two- 
stage classification strategy. The procedure is repeated for 
all the ‘V’ folds and the average recognition accuracy from 
all these folds is considered as the recognition accuracy of 
the model. The various features extracted in the two-stage 
classification strategy are as follows. 

FEATURE EXTRACTION 

a. Cell-wise pixel count: The image, I, is divided into cells. 
The number of cells obtained from a M x M character pat¬ 
tern is M 2 /n 2 , where n *n is the size of the cell. The number 
of object pixels is counted in each cell/zone i.e., the pixels 
distributed in various cells are considered as features to clas¬ 
sify the handwritten Telugu characters. 
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For each cell/zone i.e., say z n the number of foreground pix¬ 
els are summed up and is considered as a feature. This proce¬ 
dure is repeated for other cells. The features computed from 
these cells/zones of the character image are concatenated to 

form a feature vector, represented by C f = [z u , z 12 , z 13 ,.z 54 , 

z ]. Hence for an image, I, in the proposed work 25 features 
are extracted (for M=50). In this way for all the database im¬ 
ages feature set consisting of 25 features for each image are 
extracted. 

b. Histogram profile: The flowchart of histogram profile 
is shown in Figure 2. The histograms of the character image 
are computed along four directions. This is described in the 
flowchart. All these profiles are appended to form a feature 
vector of size 298 for a normalized character image of size 
50 x 50. 



Figure 2: Flowchart of histogram profile. 

RESULTS 

The two-stage classification model is developed on a system 
having i5 processor of 2.2 GHz CPU clock speed with 4 GB 
RAM and 64 bit operating system running with Windows 
8.1 using MATLAB 2014a. The number of characters con¬ 
sidered for simulation is 18,000 from 50 different classes, 
written by 360 different scribers. The number of samples per 
class is 360. All the characters are cross validated, by divid¬ 
ing them into training and testing sets. Each fold contains 
characters written by 45 different scribers. The number of 
samples considered in each fold is 2,250 i.e., 50 x 45 (where 
50 is number of classes and 45 is the number of scribers). To 
test a fold of characters, the remaining 15,750 characters are 
used as training. With 8-fold cross validation all the charac¬ 
ters are tested once, provided the training and testing sets are 
disjoint. The average classification rates obtained from all 
these folds is considered as the classification rate/recognition 
accuracy (RA) of the model. 

In the first stage, once the characters undergo tests us¬ 
ing k-NN (k-nearest neighbor) classifier, the unrecognized 
character images from this stage are forwarded to undergo 


classification in the second stage. In the second stage, SVM 
(Support Vector Machine) classifier is trained with the same 
training set to classify only the unrecognized characters from 
the test set. This is done to improve the recognition accuracy 
and to reduce the misclassification rate. The recognition ac¬ 
curacies obtained with the two stage classification system 
are tabulated in Table 1. 


Table i: Two stage results obtained using k-NN and 
SVM classifiers 


Feature extraction 

% Recognition accuracy 


k-NN 

SVM 

k-NN+SVM 

Cell-wise pixel count 

82.3 

88.4 

90.8 

Histogram profile 

73*5 

77*9 

79-3 


DISCUSSIONS 

It is evident from Table 1 that with the framework of two 
stage classification, there is a significant improvement in 
recognition rates. With the two stage classification frame¬ 
work, the cell-based approach gave a quantum improvement 
in recognizing the handwritten Telugu characters, compared 
to the Histogram profile-based approach. An improvement 
of 4-5% in recognition accuracy is obtained using the two- 
stage framework with both the feature extraction approaches. 

CONCLUSIONS 

There is no standard dataset for Indian scripts to conduct ex¬ 
periments for handwritten character recognition. Hence, in 
this work a dataset containing 18,000 handwritten Telugu 
isolated basic characters is developed. The various feature 
extraction algorithms employed for character recognition 
are cell-wise pixel count and histogram profile. The perfor¬ 
mance of these feature sets are tested with the proposed two- 
stage classification system. 

An improvement of 3-5% in recognition rate is achieved 
with the proposed two-stage classification system when 
compared to single-stage classification for the feature extrac¬ 
tion approaches considered. The best recognition accuracy 
obtained using the proposed two stage classification frame¬ 
work is 90.8% with ‘cell-wise pixel count’ feature set. 
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